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Abstract 

This paper is concerned with a generic class of predictive speech coders that includes 
the newly proposed The Self Excited Vocoder (SEV) [5] and the well known Code Excited 
Linear Predictive Coder (CELPC) [6]. All members of this class form an excitation sequence 
for a linear predictive model filter using the same general model for the excitation signal. 
The general excitation model is based on a block coding technique where each sequence is 
drawn from an ensemble of sequences. This paper reports on two developments related to 
this general model. The first development is a new type of excitation ensemble that can in 
general be populated by many different types of sequences. The second development is a 
means of populating this new type of ensemble based on a vector quantizer design procedure 
using a new distortion measure. 


1 Introduction 

A general model for the excitation signal in linear predictive speech coders was originally 
presented in [5], Formal subjective tests, summarized in [4], characterized the performance 
of selected coders in this general class of predictive speech coders. A Self Excited Vocoder 
has been implemented in real time on a single circuit board using the AT&T DSP32 floating 
point digital signal processing devices [1], This implementation will serve as a prototype 
vocoder in the NASA sponsored Mobile Satellite Communications Project. 

This paper presents a new approach to the excitation modeling problem in self excited 
and code excited vocoders. The paper begins by reviewing the general model for the excita- 
tion signal in this class of predictive speech coders, and introduces a new type of excitation 
ensemble. Then a new procedure for populating the excitation ensemble using a proce- 
dure based on an iterative vector quantizer design algorithm is discussed. Finally, the last 
section, a new distance measure for the vector quantization procedure is introduced. 


2 A New Class of Excitation Ensembles 

The general model for the excitation signal in this class of coders is described by the block 
diagram in Figure la. The excitation signal, e[n], is a linear combination of component 
excitation sequences, e*[n], where the fcth sequence is chosen from the associated excitation 
ensemble, /*. An excitation ensemble is simply a collection of discrete functions, / 7 [n], 
indexed in sample space by 7 and indexed in time by n. The optimum ensemble index, 
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7Jt, and gain, /?*, associated with the fcth excitation sequence are found by exhaustively 
searching through the excitation ensemble, T ky for that ensemble function that minimizes a 
weighted mean squared error [ 5 ]. 



Figure Is a) Model of the excitation signal for a generic class of predictive speech 
coders, b) Ensemble search interpretation of a single tap long-term predictor. 


Examples of some existing predictive coders can be identified if the excitation ensemble 
is constrained to contain a particular class of sequences. For example, the well known 
CELPC chooses an optimum excitation sequence from a stochastic ensemble, where each 
ensemble sequence is populated by Gaussian random varieties [6]. Figure lb shows how a 
simple long-term predictor can be interpreted as a time— varying excitation ensemble. In 
this case, the ensemble is the memory of a long-term predictor, whose predictor delay can 
vary over the expected range of a pitch period in speech. Each ensemble sequence is formed 
by sliding an N point rectangular window along the memory of the long-term predictor. 
The optimum ensemble sequence corresponds to an N point sequence beginning at sample 
—7 in the memory of the long-term predictor. This type of ensemble, referred to here as the 
“self excitation” ensemble, forms the basis for the SEV. After a brief period of initialization, 
the SEV derives its excitation signal, e[n] = / 3 e[n — 7], solely from this type of ensemble. 

The flexibility of the most general model of the excitation signal is derived from the fact 
that it poses no structure on the functions contained in the excitation ensemble. From the 
model definition, there is no fundamental requirement that an excitation ensemble be homo- 
geneous. Thus, a single excitation ensemble can contain more than one class of sequences. 
For example, an ensemble can be formed by combining a set of time— varying sequences 
chosen from the memory of a long-term predictor with a set of fixed Gaussian random se- 
quences. Figure 2 a illustrates an interpretation of a simple coder whose excitation is derived 
from this type of ensemble. While the figure suggests that a hard classification procedure is 
taking place, this is actually not the case. The ensemble search procedure chooses a single 
sequence from the entire ensemble, so the determination of which class of sequences is used 
is made by choosing the single sequence which results in the least measured distortion. This 
type of excitation ensemble will be referred to as a nonhomogeneous ensemble, and can, in 
general, contain many different classes of sequences. The particular ensemble illustrated by 
the block diagram in Figure 2 a is described by the excitation signal, e\n\ = / 3 z^[n], where 


z^n] = 




t^[n] 
e[n - 7] 


1 < 7 < C 
C < 7 < F 


( 1 ) 


and the fixed sequences, 17 [n], may be populated in many different ways. 
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Figure 2: a) An interpretation of a simple nonhomogeneous predictive speech coder, b) 
Block diagram illustrating a procedure for determining the fixed ensemble sequences 
in a nonhomogeneous excitation ensemble. 

3 Populating Nonhomogeneous Ensembles 

This section describes a technique for determining the fixed sequences, u 7 [n], in Equation 1 
using the vector quantizer design procedure of Linde et al [2]. Following the reasoning of 
Davidson et al, the distance measure used for the vector quantization procedure can be the 
same weighted mean squared distance used for coding the excitation signal in this class of 
coders [ 7 ], The following discussion describes the vector quantizer design procedure as it 
applies to populating the fixed sequences of the nonhomogeneous excitation ensemble. 

The generalized Lloyd algorithm, originally introduced in [2], is an iterative algorithm 
for designing an optimum vector quantizer by a method of successive approximation. The 
vector quantizer design procedure determines the the sequences, v 7 [n], 7 = of 

Equation 1 from the training vectors, z, , t = 1, . . . , n, derived from the original speech. At 
each iteration of the algorithm the training vectors are partitioned into clusters, and cluster 
centroids are computed based on the partitioning of the data. The splitting algorithm of 
Linde et al is used here to provide the initial cluster centroids. The cluster centroids that 
exist upon termination of the algorithm form the resulting excitation ensemble. 

Figure 3 is a block diagram illustrating the computation of the distortion, d(v y , z,), that 
is used for the vector quantizer data set partitioning and clustering procedures. For each 
excitation analysis frame, t, the coder represents the residual vector, r t , with an ensemble 
vector, v 7 . The coder also computes the short-term predictor, A,(z), and the excitation 
gain, The Atal LPC based weighting filter, Wi(z) [6], is used to compute the weighted 
Euclidean distance. The distance between training vector, z,-, and ensemble vector, v 7 can 
be expressed as 


N+L—2 / N - 1 \ 2 

d(z,-,v 7 ) = j y,-[n] - ft ^ v 7 [n]ft[n - /] ) , (2) 

where ft[n] is a finite length impulse response approximation to the cascaded synthesis and 
error weighting filters in Figure 3 . The length of this impulse response is approximated as L 
samples ( L « 10). The distance calculation in Equation 2 suggests the form of the training 
data required for each excitation frame. To compute this distance for the t'th excitation 
frame, the weighted speech y the impulse response h,, and the ensemble gain ft must 
all be derived from the input speech. The form of each training vector is then given as 
= (y I, hi,/?,') . Therefore, the training data is derived from the original speech using the 
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Figure 3: Block diagram illustrating the distortion measure computation for Equa- 
tion 2. 


predictive speech coder itself. The specification of an initial excitation ensemble for this 
coder is necessary for the generation of the training data. 

In this research, a nonhomogeneous ensemble was generally divided into self excitation 
sequences and alternate sequences The procedure for populating the alternate sequences 
of the nonhomogeneous ensemble shown in Figure 2a is illustrated by the block diagram 
shown in Figure 2b. The procedure begins by generating the training data, z,-, using an 
predetermined set of signals for the alternate sequences. In this research, the alternate se- 
quences are populated by independent Gaussian random varieties. Once the training data 
has been generated, a classification procedure is used to select a subset of the training data 
to be used as input to the vector quantizer design procedure. This classification procedure 
simply chooses those training vectors where the predictive speech coder provides a poor 
representation of the original speech. Finally, the vector quantizer design procedure pro- 
duces sequences that are used to populate the alternate sequences in the nonhomogeneous 
ensemble. This procedure is described in the next section. 


4 A New Vector Quantizer Distortion Measure 

This section describes a new distance measure for use in the iterative excitation vector 
quantization procedure. The new distance measure follows immediately from Equation 2, 
and results in circularly defined excitation ensemble sequences. The discussion is broken 
into three parts. First, the centroid calculation following from the weighted Euclidean 
distance measure of Equation 2 is described. Second, the short— comings of this distance 
measure when applied to vector quantization of the excitation signal are discussed. Finally, 
the new distance measure is introduced. 

Determining the centroid for a given cluster of training vectors corresponds to find- 
ing that sequence, v^, that minimizes an average distortion for the distance measure in 
Equation 2. This average distortion represents an average over all of the training vectors 
belonging to the cluster. Minimizing the average error for a cluster containing M training 
vectors with respect to fc = 0, . . . , N — 1, yields the matrix equation 

M M 

= (3) 
«=l <=l 

The vector q,- is an N length vector where the A:th element corresponds to the crosscorre* 
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lation between the weighted speech and the impulse response for excitation frame t, 

N + L - 2 

?«[*] = Pi H !«HM b - k], k = 0 , . . . , N - 1 . (4) 

n =0 

The matrix R,- is an N x N toeplitz matrix where the element in the /th row and Jfcth column 
is given by the impulse response for excitation frame t, 

N - 1 

= (5) 

1=0 

The matrix order, N , corresponds to the length of the excitation analysis frame, which is 
typically about twenty samples. Hence, computing the cluster centroid is not a compu- 
tationally expensive procedure, requiring only the solution of a twentieth order Toeplitz 
matrix equation. 

A major shortcoming of the above algorithm concerns the weighted Euclidean distance 
given in Equation 2. By this measure, the distance between two training vectors, where 
both vectors represent very similar excitation signals, may actually be very large. This is 
due to the fact that the excitation analysis window is placed asynchronously with respect 
to any significant events that may occur in the excitation signal. About 24, 000 training 
vectors derived from isolated words uttered by a single speaker were used as training data 
for this algorithm. Ensemble sequences derived from this algorithm were used to code a 
short utterance from the same speaker. There was a significant improvement in segmental 
signal-to-noise ratio using this new ensemble over that of a Gaussian ensemble. However, 
the improvement in subjective performance was not significant when judged by the authors 
in informal listening tests. A modification to this procedure is proposed here that reduces 
the dependency of the training vectors on the position of the associated excitation analysis 
frame. The modification to the design procedure results in a redefinition of the distance 
measure and centroid calculation of the vector training algorithm. 

The modification is based on simple permutations of the weighted speech that is used to 
form the training vector z,. The vector valued permutation 7Tt is a k sample circular right 
shift, 

*k{y) = {y[N - k},y[N - k + 1], ... ,y[0},y[l}, ... ,y[N - Jc - l}) . (6) 

By applying one of the permutations, {if* : k = 0, . . . , N - 1}, to the training data, similar 
events occurring in different excitation frames may be aligned in time. 

The distance measure and centroid calculation can be modified to exploit this behavior. 
First, the kth permutation of the t'th training vector is defined as x£(z,) = 

The weighted Euclidean distance of Equation 2 is restated a s 

d(zi,v 7 )= min <*(**&), ? 7 ) • (7) 

“ft i k — 0 — 1 

The distance between a training vector and a cluster centroid is therefore defined as the 
minimum weighted Euclidean distance across all possible permutations of the input data. 
Having found the optimum partition by minimizing the average distortion, the centroid 
vector, for centroid, 7, can be determined by solving the matrix equation, 

M M 

X)4(q.) = ]£ R <*r • (8) 

«=1 «=1 

In this equation, is the optimum permutation for training vector and M is the total 
number of training vectors in the cluster. 
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5 Conclusions 

This paper has introduced the nonhomogeneous excitation ensemble as a new type of en- 
semble used in a generic class of predictive speech coders. An iterative vector quantization 
procedure using a newly defined distance measure has been discussed as a means for pop- 
ulating the sequences for a specific nonhomogeneous ensemble. In this new procedure, the 
optimum choice of ensemble sequence is less dependent on the alignment of the excitation 
analysis frame with the original speech waveform. The procedure involves applying a set of 
circular permutations to the training data in order to time align similar events in different 
training vectors. The ensemble search procedure for this newly defined ensemble involves 
an exhaustive search, computing the weighted mean squared coding error for each circular 
permutation of each N point ensemble sequence. This is essentially equivalent to increasing 
the number of sequences in the ensemble from F sequences to FN sequences. However, 
the number of operations required to search this ensemble can be considerably reduced by 
using the recursive ensemble search procedure introduced in [3]. 
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